183 research outputs found
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
The state of the art in music source separation employs neural networks
trained in a supervised fashion on multi-track databases to estimate the
sources from a given mixture. With only few datasets available, often extensive
data augmentation is used to combat overfitting. Mixing random tracks, however,
can even reduce separation performance as instruments in real music are
strongly correlated. The key concept in our approach is that source estimates
of an optimal separator should be indistinguishable from real source signals.
Based on this idea, we drive the separator towards outputs deemed as realistic
by discriminator networks that are trained to tell apart real from separator
samples. This way, we can also use unpaired source and mixture recordings
without the drawbacks of creating unrealistic music mixtures. Our framework is
widely applicable as it does not assume a specific network architecture or
number of sources. To our knowledge, this is the first adoption of adversarial
training for music source separation. In a prototype experiment for singing
voice separation, separation performance increases with our approach compared
to purely supervised training.Comment: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). Implementation available at
https://github.com/f90/AdversarialAudioSeparatio
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages
Lyrics alignment gained considerable attention in recent years.
State-of-the-art systems either re-use established speech recognition toolkits,
or design end-to-end solutions involving a Connectionist Temporal
Classification (CTC) loss. However, both approaches suffer from specific
weaknesses: toolkits are known for their complexity, and CTC systems use a loss
designed for transcription which can limit alignment accuracy. In this paper,
we use instead a contrastive learning procedure that derives cross-modal
embeddings linking the audio and text domains. This way, we obtain a novel
system that is simple to train end-to-end, can make use of weakly annotated
training data, jointly learns a powerful text model, and is tailored to
alignment. The system is not only the first to yield an average absolute error
below 0.2 seconds on the standard Jamendo dataset but it is also robust to
other languages, even when trained on English data only. Finally, we release
word-level alignments for the JamendoLyrics Multi-Lang dataset.Comment: 5 pages, accepted at the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP) 202
Deep Learning for Music Information Retrieval in Limited Data Scenarios.
PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings
where large amounts of annotated training data are available, over tting often
degrades performance when data is more limited. To improve the generalisation
of DL models, we investigate \data-driven priors" that exploit additional unlabelled
data or labelled data from related tasks. Unlike techniques such as data
augmentation, these priors are applicable across a range of machine listening
tasks, since their design does not rely on problem-speci c knowledge.
We rst consider scenarios in which parts of samples can be missing, aiming to
make more datasets available for model training. In an initial study focusing on
audio source separation (ASS), we exploit additionally available unlabelled music
and solo source recordings by using generative adversarial networks (GANs),
resulting in higher separation quality. We then present a fully adversarial
framework for learning generative models with missing data. Our discriminator
consists of separately trainable components that can be combined to train the
generator with the same objective as in the original GAN framework. We apply
our framework to image generation, image segmentation and ASS, demonstrating
superior performance compared to the original GAN.
To improve performance on any given MIR task, we also aim to leverage
datasets which are annotated for similar tasks. We use multi-task learning (MTL)
to perform singing voice detection and singing voice separation with one model,
improving performance on both tasks. Furthermore, we employ meta-learning
on a diverse collection of ten MIR tasks to nd a weight initialisation for a
\universal MIR model" so that training the model on any MIR task with this
initialisation quickly leads to good performance.
Since our data-driven priors encode knowledge shared across tasks and
datasets, they are suited for high-dimensional, end-to-end models, instead of small
models relying on task-speci c feature engineering, such as xed spectrogram
representations of audio commonly used in machine listening. To this end, we
propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS
directly on the raw waveform while performing favourably to its spectrogrambased
counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave-
U-Net, which performs comparably to Wavenet and Temporal Convolutional
Network (TCN) on a variety of sequence modelling tasks, while being more
computationally e cient.
- …